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Abstract 

Independence screening is a powerful method for variable selection for 'Big 
Data' when the number of variables is massive. Commonly used independence 
screening methods are based on marginal correlations or variations of it. In 
many applications, researchers often have some prior knowledge that a cer- 
tain set of variables is related to the response. In such a situation, a natural 
assessment on the relative importance of the other predictors is the condi- 
tional contributions of the individual predictors in presence of the known set 
of variables. This results in conditional sure independence screening (CSIS). 
Conditioning helps for reducing the false positive and the false negative rates 
in the variable selection process. In this paper, we propose and study CSIS 
in the context of generalized linear models. For ultrahigh-dimensional statis- 
tical problems, we give conditions under which sure screening is possible and 
derive an upper bound on the number of selected variables. We also spell out 
the situation under which CSIS yields model selection consistency. Moreover, 
we provide two data-driven methods to select the thresholding parameter of 
conditional screening. The utility of the procedure is illustrated by simulation 
studies and analysis of two real data sets. 
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1 INTRODUCTION 



Statisticians are nowadays frequently confronted with massive data sets from various 
frontiers of scientific research. Fields such as genomics, neuroscience, finance and 
earth sciences have different concerns on their subject matters, but nevertheless share 
a common theme: They rely heavily on extracting useful information from massive 
data and the number of covariates p can be huge in comparison with the sample size 
n. In such a situation, the parameters are identifiable only when the number of the 
predictors that are relevant to the response is small, namely, the vector of regression 
coefficients is sparse. This sparsity assumption has a nice interpretation that only a 
limited number of variables have a prediction power on the response. To explore the 
sparsity, variable selection techniques are needed. 



Over the last ten years, there has been many exciting developments in statis- 
tics and machine learning on variable selection techniques for ultrahigh dimensional 
feature space. They can basically be classified into two classes: penalized likeli- 
hood and screenin g. Penalized likelihood tec hniques are well known i n statistics: 



Bridge regression (IFrank and Friedman 



19931 ). Lasso (jTibshirani 



other folded concave re gularization methods (iFan and Li 



Zhaiig and Zhang 



2012), and Dantzig selector (jCandes and Tao 



2001 



1996), SCAD or 



Fan and Lv 



2007 



2011 



Bickel et al. 



20091 ). among others. These techniques select variables and estimate p arameters si- 



multaneous 



(120091) and 



y by solving a high- dimensi o nal op timization problem. See 



Hastie et al 



Biihlmann and van de Geerl (120111 ) for an overvi ew of the field. 



the fact that various efficient algori thms have been proposed (lOsborne et al. 



Efron et al. 



Despi te 



2000a b 



2004 



Fan and Lv 



201ll ). statisticians and machine learners still face huge 
computational challenges when the number of variables is in tens of thousands of 
dimensions or higher. This is particularly the case as we are entering the era of "Big 



Data" in which both sample size and dimensionality are large. 
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With this background, iFan and Lvl ( l2008l ) propose a two-scale approach, called 
iterative sure independence screening (ISIS), whic h screens and sele cts variables it- 



eratively. The approach is further developed by 



Fan et al 



(120091 ) in the context 



of generalized linear models. Theoretical properties of sure i ndependence sc r eenin g 



for generalized linear models have been thoroughly studied 



Other marginal screening methods include tiltin g methods ( iHall et al. 



by 



Fan and Sond feoioh 



2009 



ized correlation screening ( iHall and Miller 



), general- 



20091 ). non parametric scre ening ( iFan et al. 



2OI2I ). among others. 



2OIII ). and robust rank correlation based screening fiLi et al. 
The merits of screening include expediences in distributed computation and imple- 
mentation. By ranking marginal utility such as marginal correlation with the re- 
sponse, variables with weak marginal utilities are screened out by a simple threshold- 
ing. 

The simple ma rginal screening faces a number of challenges. As pointed out in 



Fan and Lvl (j2008[ l. it can screen out those hidden signature variables: those who have 
a big impact on response but are weakly correlated with the response. It can have 
large false positives too, namely recruiting those variables who have strong marginal 



utilities but are c ondit ionally independ ent with the response given other variables. 



Fan and Lvl (120081 ) and 



Fan et al 



( 12009! ) use a residual based approach to circumvent 



the problem but the idea of conditional screening has never been formally developed. 

Conditional marginal screening is a natural extension of simple independent screen- 
ing. In many applications, researchers know from previous investigations that certain 
variables Xc are responsible for the outcomes. This knowledge should be taken into 
account when applying a variable selection technique in order not to remove these 
predictors from the model and to improve the selection process. Conditional screen- 
ing recruits additional variables to strengthen the prediction power of Xc, via ranking 
conditional marginal utility of each variable in presence of X^. In absence of such a 
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prior 



as m 



knowledge, o ne can take those variables that survive the screening and selection 



Fan and Lvl (120081 ). 



Conditional screening has several advantages. First of all, it makes it possible to 
recover the hidden significant variables. This can be seen by considering the following 
linear regression model 

Y = X^l3* + e, EXe = 0, (1) 

with /3* = {f3i, . . . , f3p)'^. The marginal covariance between Xj and Y is given by 

CoY{Xj,Y) = Cov(Xj,X/3) = 

where G IW is equal to 0, except for its jth element which equals to 1. This shows 
that the marginal covariance between Xj and Y is zero if /3* = — Xlfc^-^j f^t^kj, where 
(Tfcj is the {k,j) element of S = Var(X), with X = {Xi, . . . , Xp)'^ . Yet, /3* can be far 
away from zero. In other words, under the conditions listed above, Xj is a hidden 
signature variable. To demonstrate that, let us consider the case in which p = 2000, 
with true regression coefficients f3* = (3, 3, 3, 3, 3, —7.5, 0, ■ ■ ■ , 0)"^, and all variables 
follow the standard normal distribution with equal correlation 0.5, and e follows the 
standard normal distribution. By design, Xg is a hidden signature variable, which 
is marginally uncorrelated with the response Y. Based on a random sample of size 
100 from the model, we fit marginal regression and obtain the marginal estimates 
{f^j^}^=i- "^^^ magnitudes of these estimates are summarized by their averages over 
three groups: indices 1 to 5 (denoted by /flffs), 6 and indices 7 to 2000. Clearly, 
the magnitude on the first group should be the largest, followed by the third group. 



Figure 1(a) depicts the distributions of those marginal magnitudes based on 10000 



simulations. Clearly variable Xq can not be selected by marginal screening. 
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Figure 1: Benefits of conditioning against false negatives. Upper left panel: the 
distributions of the averages of magnitudes |/3*^| of marginal regression coefficients 
over three groups of variables 1:5, 6, 7:2000. Upper right panel: the distributions 
of the averages of the magnitude \Pcj \ °f conditional marginal regression coefficients 
over two groups of variables: 6 and 7:2000. Lower left panel: the distributions of 
the magnitudes of conditional marginal regression when the conditioned set 

includes inactive variables. Lower right panel: the distributions of the averages of the 
magnitude \Pcj\ of conditional marginal regression coefficients given five randomly 
chosen variables. 

Adapting the conditional screening approach gives a very different result. Condi- 
tioning upon the first five variables, conditional correlation between Xq and Y has a 
large magnitude. With the same simulated data as in the above example, the regres- 
sion coefficient of Xj in the joint model with the first five variables is computed. 
This measures the conditional contribution of variable Xj in presence of the first five 
variables. Again, the magnitudes l}j='6' summarized into two values: |/3^g| 



and the average of {|/3cf I The distributions of those over 10000 simulations are 



also depicted in Figure 1(b) Clearly, the variable Xq has higher marginal contribu- 
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tions than others. That is, conditioning helps recruiting the hidden signature variable. 
Furthermore, conditioning is fairly robust to extra elements. To demonstrate that, 
we have repeated the previous experiment with conditioning on five more randomly 



chosen features. The distribution of the magnitudes are given in Figure 1(c) It is 
seen that the important hidden variable again has a large magnitude. 

The benefits of conditioning are observed even if the conditioned variables are not 
in the active set. To demonstrate that, the regression coefficient /3^j^ of Xj has been 
computed while conditioning on five randomly chosen inactive variables. That is, 
contribution of variable Xj is calculated in the presence of these five randomly chosen 
inactive variables. The magnitudes of {l/S^j are summarized in three groups: 
the average of the first five important variables, i.e. |}j=i, \$ci\ average 
of l}^^??- The distributions for these variables over 10000 simulations are given 



in Figure 1(d) It is observed that the magnitude of the hidden signature variable 
increases significantly and hence it will surely not be missed during the screening. In 
other words, conditioning can help to recruit the important variables, even when the 
conditional set is not ideally chosen. 

Secondly, conditional screening helps for reducing the number of false negatives. 
Marginal screening can fail when there are covariates in the non-active set that are 
highly correlated with active variables. To appreciate this, consider the linear model 
([T]) again with sparse regression coefficients f3* = (10, 0, ■ ■ ■ ,0, 1)"'", equi-correlation 
0.9 among all covariates except X200O) which is independent of the rest of the covari- 
ates. This setting gives 

Cov(Xi, Y) = 10, Cov(X2ooo, Y) = l, and Cov(Xy, Y) = 9 for j ^ 1, 2000. 



In this case, marginal utilities for all nonactive variables are higher than that for the 
active variable -^^2000- A summary similar to Figured] is shown in the upper left panel 
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Figure 2: Benefits of conditioning against false positives. Upper left panel: the 
distributions of the magnitude \Pf^\ of marginal regression coefficients over three 
groups of variables 1, 2:1999 and 2000. Upper right panel: the distributions of the 
magnitude of conditional marginal regression coefficients over two groups of 

variables: 2:1999 and 2000. Lower left panel: the distributions of the magnitudes 
\Pcj \ of conditional marginal regression coefficients when five inactive variables are 
included in the conditioned set. Lower right panel: the distributions of the averages of 
the magnitude \Pcj \ of conditional marginal regression coefficients given ten randomly 
chosen variables. 

of Figure [2l Therefore, based on SIS (sure independence screening) in Fan and Lv 
(2008), the active variable X2000 has the least priority to be included. By using the 
conditional screening approach in which the covariate Xi is conditioned upon (used 
in the joint fit), marginal utilities of the spurious variables are significantly reduced. 
The distributions of the average of the magnitude of the conditional fitted coeffi- 
cients |}}=2^ and |/3c2oool shown in the middle panel of Figure [21 Clearly, the 
nonactive variables are significantly demoted by conditioning. To observe effects of 
conditioning on extra variables and randomly chosen variables, a similar experiment 
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to the first case is also done. Figure 2(c) depicts the distribution of the conditioned 
marginal fits when five extra variables are conditioned on. The contributions of vari- 



ables Xj in the presence of ten randomly chosen variables are given in Figure 2(d) It 



is seen that, the relative magnitude of the hidden active variable X2000 is considerably 
larger and hence it is more likely that it is recruited during screening. 



Finally, as shown by 



Fan and Lvl (120081 ) and 



Fan and Song! (120101 ). for a given 



threshold of marginal utility, the size of the selected variables depends on the cor- 
relation among covariates, as measured by the largest eigenvalue of S: Aniax(S). 
The larger the quantity, the more variables have to be selected in order to have a 
sure screening property. By using conditional screening, the relevant quantity now 
becomes Amax (Sxi^iXc); where Xc refers to the q covariates that we will condition 
upon and X^i is the rest of the variables. Conditioning helps reducing correlation 
among covariates Xx>. This is particularly the case when covariates X share some 
common factors, as in many biological (e.g. treatment effects) and financial studies 
(e.g. market risk factors). To illustrate the benefits we consider the case where X is 
given by equally correlated normal random variables. Simple calculations yield that 
Ajnax (Sx-d) = (1 — r) + where r is the common correlation and d = p — q. As X 
has a normal distribution, the conditional covariance matrix can be calculated easily 
and it can be shown that 



Amax (Sxi,|Xc) = (1 - r) + rd— 



1 — r 



r + rq 



(2) 



Note that when g = 0, the formula reduces to the unconditional one. It is clear 
that conditioning helps reducing the correlation among the variables. To quantify 
the degree of de-correlation. Figure |3] depicts the ratio Amax (Sx-p) /^max (Sxi,|Xc) as 
a function of r for various choices of q when d = 1000. The reduction is dramatic, 
in particular when r is large or q is large. The benefits of conditioning are clearly 
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evidenced. 




Figure 3: Ratio of maximum eigenvalues of unconditioned and conditioned covariance 
matrix. 



In this paper, we propose the conditional screening technique and formally es- 
tablish the conditions under which it has a sure screening property. We also give 
an upper bound for the number of selected variables for each given threshold value. 
Two data-driven methods for choosing the thresholding parameter are proposed to 
facilitate the practical use of the conditional screening technique. 

The rest of the paper is organized as follows. In Section 2, we introduce the 
conditional sure independence screening procedure. The sure independence screening 
property and the uniform convergence of the conditional marginal maximum likeli- 
hood estimator are presented in Section 3. In Section 4, two approaches are proposed 
to choose the thresholding parameter for CSIS. Finally, we examine the performance 
of our procedure in Section 5 on simulated and real data. The details of the proofs 
are deferred to the Appendix. 
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2 CONDITIONAL INDEPENDENCE SCREEN- 
ING 

2.1 Generalized Linear Models 

Generalized linear models assume that the conditional probability density of the ran- 
dom variable Y given X = x = (xi . . . , Xp)'^ belongs to an exponential family 

/(y |x; 9) = exp (y^(x) - &(^(x)) + c(x; y)) , (3) 

where b{-) and c(-) are specific known functions in the canonical parameter 6(x.). 
Note that we ignore the dispersion parameter 0, since the interest only focuses on 
estimation of the mean regression function. However, it is easy to include a dispersion 
parameter (p. Under model ([3|), we have the regression function 

E(r|X = x) = 6'(e(x)). 

The canonical parameter is further parameterized as 

^(x) = x^/3^ 

namely the canonical link is used in modeling the mean regression function. Well 
known distributions in this exponential family include the normal, binomial, Poisson, 
and Gamma distributions. 

In the ultrahigh dimensional sparse linear model, we assume that the true param- 
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eter /3* = (/?*, . . . , is sparse. Namely, the set 

is small. Our aim is to estimate the set Ai^, and coefficient vector /3*, as well as 
predicting the outcome Y. This is a more challenging task than just predicting Y 
as in many machine learning problems. When the dimensionality is ultrahigh, one 
often employs a screening technique first to reduce the model size. It is particularly 
effective in distributed computation for dealing with "Big Data". 

2.2 Conditional Screening 

Conditional screening assumes that there is a set of variables Xc that are known to 
be related to the response Y and we wish to recruit additional variables from the rest 
of variables, given by Xx^, to better explain the response variable Y. For simplicity 
of notation, we assume without loss of generality that C is the set of first q variables 
and V is the remaining set of d = p — q variables. We will use the notation 

(3c = {/3u...,/3,f eM'^, and f3^ = {[3,+,, . . . , f3pf e M'', 

and similar notation for X^ and Xx). 

Assume without loss of generality that the covariates have been standardized so 
that 

E{Xj) = and E{X]) = 1 for j G V. 
Given a random sample {(Xj, 1^)}"^]^ from the generalized linear model ([3]) with 

- M 

the canonical link, the conditional maximum marginal likelihood estimator /3(^ for 
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j = q + 1, . . . ,p is defined as the minimizer of tlie (negative) marginal log-likeliliood 

^c, = argmin^^,^/4/(X^/3c + Xj(3j, Y)}, (4) 

wfiere 1{9,Y) = h{e) - OY and = n'^ ^"^^ /(X^, F,) is tlie empirical 

measure. Denote from now on by the last element of ^(2j- 1^ measures the strength 
of the conditional contribution of Xj given X^. In the above notation, we assume 
that the intercept is used and is incorporated in the vector X^. Conditional marginal 
screening based on the estimated marginal magnitude is to keep the variables 

Mv,, = {jeV:0;'\>^}, (5) 

for a given thresholding parameter 7. Namely, we recruit variables with large ad- 
ditional contribution given X^. This method will be referred to as conditional sure 
independence screening (CSIS). It depends, however, on the scale of Ej;^(Xj|Xc) and 
Ei(y|Xc) to be defined in Section 3.1. A scale-free method is to use the likelihood 
reduction of the variable Xj given X^, which is equivalent to computing 

Rcj = min P4Z(X^/3c + Y)}, (6) 

PC'Pj 

after ignoring the common constant min^^ P,„|/(X^/3(^, F)}. The smaller Rcj, the 
more the variable Xj contributes in presence of X^. This leads to an alternative 
method based on the likelihood ratio statistics: recruit additional variables according 
to 

Mv,^ = {jeV: Rcj < 7}, (7) 

where 7 is a thresholding parameter. This method will be referred to as conditional 
maximum likelihood ratio screening (CMLR). 
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We emphasize that, the set of variables Xc does not necessarily have to contain 
active variables. Conditional screening only makes use of the fact that the effects 
of important variables are more visible in the presence of and the correlations 
of variables are weakened upon conditioning. This is commonly the case in many 
applications such as finance and biostatistics, where the variables share some common 
factors. It gives hidden signature variables a chance to survive. In fact, it was 
demonstrated in the introduction that conditioning can be beneficial even if the set 
Xc is chosen randomly. Our theoretical study gives a formal justifications of the 
iterated method proposed in Fan and Lv (2008) and Fan et. al. (2009). 

3 SURE SCREENING PROPERTIES 

In order to prove the sure screening property of our method, we first need some 
properties on the population level. Let {3^^ = {l3Q,l3j)^, Xcj = (X^,Xj)^, and 

= argmin^^^^^ E/(X^/3c + Y), (8) 

with the expectation taken under the true model. Then, is the population version 
of fd^j. To establish the sure screening property, we need to show that the marginal 
regression coefficient the last component of /3^j, provides useful probes for the 
variables in the joint model Ai^, and its sample version /3j^'^ is uniformly close to 
the population counterpart ^ . Therefore, the vector of marginal fitted regression 
coefficients • is useful for finding the variables in Ai^. 
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3.1 Properties on Population Level 

Since we are fitting d marginal regressions, that is we are using only q + 1 out of 
the p original predictors, we need to introduce model misspecifications. Thus, we do 
not expect that the marginal regression coefficient is equal to the joint regression 
parameter However, we hope that when the joint regression coefficient |/3*| ex- 
ceeds a certain threshold, |/3j^| exceeds another threshold in most cases. Therefore, 
the marginal conditional regression coefficients provide useful probes for the joint 
regression. 

By (IH]), the marginal regression coefficients /J^^ satisfy the score equation 

E 6'(Xj,./3g)Xc, = E FXc,- = E 6'(X^/3*)Xc,-, (9) 

where the second equality follows from the fact that E(y|X) = 6'(X"^/3*). Without 
using the additional variable Xj, the baseline parameter is given by 

/3f = argmin^^E/(X^/3c,r), (10) 

and satisfies the equation 

E6'(X^/3^0Xc = EFXc = E6'(X^/3'')Xc. (11) 

We assume that the problems at marginal level are fully identifiable, namely, the 
solutions 13^^ and /3^j are unique. 

To understand the conditional contribution, we introduce the concept of the con- 
ditional linear expectation. We use the notation 

Ei(F|Xc) = 6'(X^/3*^), and E^FlXc,) = 6'(X5,./3g), (12) 
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which is the best hnearly fitted regression within the class of linear functions. Simi- 
larly, we use the notation Ej;,(Xj|Xc) to denote the best linear regression fit of Xj by 
using Xc. Then, equation f lTTj) can be more intuitively expressed as 

E(F-E,.(F|Xc))Xc = 0. (13) 

Note that the conditioning in this paper is really a conditioning linear fit and the 
conditional expectation is really the conditional linear expectation. This facilitates 
the implementation of the conditional (linear) screening in high-dimensional, but adds 
some technical challenges in the proof. 

Let us examine the implication marg inal signal, i.e. When (3f = 0, by (ED, 
the first q components of /3^j, denoted by f3cji, should be equal to (3^^ by uniqueness 
of equation (fTTI) . Then, equation on the component Xj entails 

Eb'{X.^f3c)Xj = EYX^, or EXj{Y - EL(y|Xc)) = 0. 

Using (fT3l) . the above condition can be more comprehensively expressed as 

CovL {Y,Xj\y.c) = E(X, - El(X,-|Xc))(F - Ei(F|Xc)) = 0. (14) 

This proves the necessary condition of the following theorem. 

Theorem 1. For j G V, the marginal regression parameters = if and only if 

CoYL (r,x,|Xc) =0. 

Proof of the sufficient part is given in Appendix lA.ll In order to have the 
sure screening property at the population level of equation ([8]), the important vari- 
ables {Xj,j G M.i,v} should be conditionally correlated with the response, where 
A^^-p = M-i, n v. Moreover, if Xj (with j G M.i,v) is conditionally correlated with 
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the response, the regression coefficient is non-vanishing. The sure screening prop- 
erty of conditional MLE (CMLE), given by equation (jS]), will be guaranteed if the 
minimum marginal signal strength is stronger than the estimation error. This will be 
shown in Theorem |2] and requires Condition [TJ The details of the proof are relegated 
to Appendix IA.2I 

Condition 1. 

(i) For j G A^^D, there exists a positive constant ci > and k < 1/2 such that 
|Covz.(F,X,|Xc)| >Clr^-^ 

(ii) Let rrij be the random variable defined by 

rn ■ = ; i ±- 1 — 

Then, EmjX| < C2 uniformly in j = g + 1, . . . ,p. 

Note that, by strict convexity of h{d), rrij > almost surely. When we are dealing 
with linear models, i.e. b{d) = 6^/2, then rrij = 1 and Condition HJ^ii) requires 
that is bounded uniformly, which is automatically satisfied by the normalization 
condition E-^J = 1- 

Theorem 2. // ConditionU\ holds, then there exists a c-^ > such that 

min > Csn'"". 

3.2 Properties on Sample Level 



In this section, we prove the uniform convergence of the conditional marginal max- 
imum likelihood estimator and the sure screening property of the conditional sure 
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independence screening method. In addition we provide an upper bound on the size 
of the set of selected variables A4v,'y 

Since the log-likelihood of a generalized linear model with the canonical link is 
concave, E(/(y, X^^/3(;j)) has a unique minimizer over /3^j G i3 at an interior point 
^g, where B = | < | < fi, | < B} is the set over which the 

marginal likelihood is maximized. To obtain the uniform convergence result at the 
sample level, a few more conditions on the conditional marginal likelihood are needed. 

Condition 2. 

(i) For the Fisher information Ij{f3cj) = E(fe"(Xj^/3cj)XcjX^^), its operator norm, 
||/j(/3cj)||B is bounded, where 

||/,(/3c,)b= sup \\I,i(3c,Y^WM 
/3c,6e,||Xc,||=i 

and II ■ II is the Euclidian norm. 

(ii) There exists some positive constants tq, ri, sq, Si and a such that for sufficiently 
large t 

Pi\X,\ >t)<r, exp(-ror) for j = 1, . . . ,p 

and that 

E (6(X^/3* + So) - biX.^13*)) + E (6(X^/3* - sq) - KX^fB*)) < si. 

(iii) The second derivative of b{6) is continuous and positive. There exists an £i > 
such that for all j = q + 1, . . . ,p: 



sup \Eb{Xlf3c,)I{\X^\ > Kn)\ < o{n-'), 

/3c.eB,||/3c,-/3cill<^i 
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where /(■) is the indicator function and Kn is an arbitrarily large constant such 
that for a given (3 in i3, the function l{'}c^(3,y) is Lipschitz for all (x, y) in 
A„ = {x, y : ||x||^ < \y\ < K*} with K*^ = r^K^/s^. 

(iv) For all jS^j G i3, we have 

E (/(X^,./3c„ 1^) - /(X?;./3g, Y)) > V\\f3c, - /3c1lP, 
for some positive V, bounded from below uniformly over j = q + 1, . . . ,p. 

The first three conditions given in Condition [2] are satisfied for almost all of the 
commonly used generalized linear models. Examples include linear regression, lo- 
gistic regression, and Poisson regression. The first part of Condition [2](ii) puts an 
exponential bound on the tails of Xj. 

In the following theorem, the uniform convergence of our conditional marginal 
maximum likelihood estimator is stated as well as the sure screening property of the 
procedure. The proof of this theorem is deferred to Appendix IA.3[ 

Theorem 3. Suppose that Condition\E holds. Let kn = b'{KnB{q + 1)) + tqK'^/ sq, 
with Kn given in Condition{^ 

(i) If n^~'^'^k~'^K~^ — 7- oo, then for any C3 > 0, there exists a positive constant C4 
such that 

P( max |/3f -/3f I > c^n^A 
< d exp ( - C4n^"'^'^{knKn)~'^) + dnr2 exp ( - r^K^) , 

where r2 = qri + Si . 

(ii) If in addition, ConditionU\ holds, then by taking 7 = c^n''^ with C5 < C3/2, we 

19 



have 



P (^M^v C Mv,^^ > 1 - s exp ( - "^""{KKn) ^) - nr2S exp ( - r^K^) , 

for some constant C5, where s = |A^*x>| the size of the set of nonsparse elements. 

Note that the sure screening property, stated in the second conclusion of Theo- 
rem 3, depends only on the size s of the set of nonsparse elements and not on the 
dimensionality d or p. This can be seen in the second conclusion above. This result 
is understandable since we only need the elements in A^^© to pass the threshold, and 
this only requires the uniform convergence of over j G A^^d- 

The truncation parameter Kn appears on both terms of the upper bound of the 
probability. There is a trade-off on this choice. For the Bernoulli model with logistic 
link, b'{-) is bounded and the optimal order for Kn is n^^~'^'^y (°'+'^\ In this case, the 
conditional sure independence screening method can handle the dimensionality 



log d = o{n 



(l-2K)Q/(a+2) 



) 



which guarantees that the upper bound in Th eorem [3] converges to zero. A similar 



Fan and SongI ( 120101 ). In particular 



result for unconditional screening is shown in 
when the covariates are bounded, we can take a = 00, and when covariates are normal, 
we have that a = 2. F or the normal linear model, following the same argument as 
Fan and Song (l2010l ). the optimal choice is Kn = n^^~'^'^y^ where A = max{a + 



m 



4, 3tt + 2}. Then, conditional sure independence screening can handle dimensionality 



log d = o [ 



n 



-{l-2K)a/A 



which is of order o{n 2k)/4'j -^j^g^ a = 2. 
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We have just stated the sure screening property of our CSIS method, that is 
M T>,^ D A^vcD- However, a good screening method does not only possess sure screen- 
ing, but also retains a small set of variables after thresholding. Below, we give a bound 
on the size of the selected set of variables, under the following additional conditions. 

Condition 3. 

(i) The variance Var(X^/3^) = /3^^S/3^ and b"{-) are bounded. 

(ii) The minimum eigenvalue of the matrix E[mjXcjX^j] is larger than a positive 
constant, uniformly over j, where rrij is defined in Condition (U^ii). 

(iii) Letting 

Z = E { E [X^,|Xc] [X^r - ^c(3c] }, 

it holds that \\Z\\l = o|Amax(Sx)|c) |, with \raa^{^v\c) the largest eigenvalue 
of E^ic = E[Xp - E^(X2,|Xc)][Xx, - Ei(X^|Xc)]^. 

As noted above, for the normal linear model, b{6) = 6^/2. Condition |3](ii) requires 
that the minimum eigenvalue of EXq-X^^ be bounded away from zero. In general, 
by strict convexity of b{9), rrij > almost surely. Thus, Condition [3]^ ii) is mild. 

For the linear model with b'{6) = 6, by fllip . 

EXcXj/3^^ =EXcX^/3* 

and hence Z = since E^ [Xx)|Xc] is linear in Xc by definition. Thus, Condition [^ii) 
holds automatically. 

From the proof of TheoremHJ without Condition [3](iii), Theorem H] below continues 
to hold with S-p|c replaced by Sx)|c + ZZ"^. 
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Theorem 4. Under Conditions\^and\^ we have for '-j = c^n there exists a C4 > 
such that 

P(|>li.,7l <0(n2-A^ax(Si,|c))) 
> 1 - d(^exp ( - an^-^^iknKn)-^) + nrsexp ( - roi^^)). 

This theorem is proved in Appendix IA.4I 

4 SELECTION OF THE THRESHOLDING PA- 
RAMETER 

In the previous section, we have shown that CSIS has the sure screening property 
when the thresholding level 7 is chosen such that 7 oc n~'^. Unfortunately, in practice 
7, which relates to the minimum strength of marginal signals in the data, is always 
unknown. Therefore, 7 has to be estimated from the data itself. Underestimating 7 
will result in a lot variables after screening, which leads to a large number of false 
positives, and similarly overestimation of 7 will prevent sure screening. 

In this section, we present two procedures that select a thresholding level for CSIS. 
The first approach is based on controlling the number of false positives by bound- 
ing the false discovery rate (FDR). This method uses the fact that quasi-likelihood 
estimates for GLMs enjoy asymptotic normality. The second approach, that we call 
random decoupling, uses a resampling technique to create the null model and to 
measure the maximum strength of noise. In random decoupling, we use marginal 
regression on the null model to obtain the marginal regression coefficients that are 
known to be zero. We use the maximum of these marginal coefficients of the null 
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model as a thresholding level. 



4.1 Controlling FDR 



It is well known that quasi-maximum likeliho od estimate s have an asymp totically 



(Hevde 


1997; 


Gao et al. 


2008) 



for covariates j such that, /3j = 0, asymptotically it follows that 

1/2 



/3f ~ mo, 1) 



where Ij denotes the element that corresponds to (3j in the information matrix 

Using this property, we can build a thresholding technique that bounds the propor- 
tion of elements j such that, = 0. For the ca se, when Pj'^ = for all j G (A^v,d) 



this rate is also called the false discovery rate in 
E 



Zhao and Li 



(I2OI2I ) and is given by 



Mv,sn{M.vy /\{M.v 
By choosing Aix>,5 = 



1/2 



M 



> 6>, the expected false discovery 



rate is bounded above by 2 (1 — $ (5)), where $(■) is the distribution function of a 
standard normal random var iable. This approac h can also be seen as a modification 



of the method introduced by 



Zhao and Li 



fl2012l ) for the Cox model. By setting 6 to 



(1 — f/{2d)) where / is the maximum number of false positives we can tolerate, 
we obtain an expected false positive rate that is less than f/{d — |A^^d|) as the 
following theorem shows. The proof of this theorem is given in Appendix IA.5[ 

Condition 4. 

1. For any j, let Cj = Fj — b'{X.J^^jf3cj) for z = 1, . . . , n. For a given j, Var(ei) > cq 
for some positive Cg and i = 1, . . . , n and supj>]^ E lejp"^^ < 00 for some x > 0- 
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2. For j e {M^vT, we have that Covl {Y,Xj\Xc) = 0. 



Theorem 5. Under Conditionsl^lE and\^ if we choose 



Mv,5 




AI 



) 



1/2 




where 5 = $ ""^ (1 — f /C^d)) and f is the number of false positives that can be tolerated, 
then, for some constant ct > it holds that 



4.2 Random Decoupling 

Random decouphng is an another procedure to select the thresholding parameter 7. 
It is used to create a null model, in which the data is formed by randomly permuting 
the rows of the last d columns of the design matrix, while keeping the first q columns 
of the design matrix intact. It is easy to see that by regressing Y on X^^- where the 
rows of the design matrix corresponding to Xj [j ^ C) have been randomly permuted, 
the obtained marginal values of is a statistical estimate of zero. These marginal 
estimates based on decoupled data measure the noise level of the estimates under the 
null model. Let 7* = maxq+i<j<p If 7* is used as the thresholding value, all 

variables will be screened out based on the permuted data, which leads to no false 
positives in this case. In other words, it is the minimum thresholding parameter 
that makes no false positives. However, this 7* depends on the realization of the 
permutation. To stabilize the thresholding value, one can repeat this exercise K 
times (e.g. 5 or 10 times), resulting in the values 




(15) 
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{7fc}f=i, where 7^ = max,+i<,<p 

Now, one can choose the maximum of {7^}^^, denoted by 7max5 a thresholding 
value. A more stable choice is the r-quantile of the values in ( IT5l) . denoted it by 7*. 
A useful range for r is [.95, 1]. Note that for r = 1, 7^^ = 7max- The selected variables 
are then 

Mv,r = {j : I > 7:}. 
In our numerical implementations, we do coupling five times, i.e. K = 5, and take 



r = 0.99. A similar idea for unconditional SIS appears already in 
for additive models. 



Fan et al 



(120 111 ) 



5 NUMERICAL STUDIES 



In this section, we demonstrate the performance of CSIS on simulated data and 
two empirical datasets. We compare CSIS versus sure independence screening and 
penalized least squares methods in a variety of settings. 



5.1 Simulation Study 



In the simu^ 



( iTibshirani 



at ion study, we compare the pe rformance of the prop osed CSIS with Lasso 



19961 ) and unconditional SIS ( iFan and Song 



2OIOI), in terms of variable 



screening. We vary the sample size from 100 to 500 for different scenarios and the 
number of predictors range from p = 2, 000 to 40, 000. We present results with both 
the linear regression and the logistic regression. 



We evaluate different screening methods on 200 simulated data sets based on the 
following criteria: 
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1. MMMS: median minimum model size of the selected models that are required to 
have a sure screening. The sampling variability of minimum model size (MMS) 
is measured by the robust standard deviation (RSD), which is defined as the 
associated interquartile range of MMS divided by 1.34 across 200 simulations. 

2. FP: average number of false positives across the 200 simulations, 

3. FN: average number of false negatives across 200 simulations. 

We consider two different methods for selecting thresholding parameters: controlling 
FDR and random decoupling as outlined in the previous section, and we present 
false negatives and false positives for each method. Number of average false positives 
and false negatives are denoted by FPtt and FN,r for the random decoupling method 
and FPfdr and FNfdr for the FDR method. For the FDR method, we have chosen 
the number of tolerated false positives as n/logn. For the experiments with p = 
5, 000 and p = 40, 000, we do not report the corresponding results for Lasso, since 
it is not proposed for variable screening, and the data-driven choice of regularization 
parameter for model selection is not necessarily optimal for variable screening. 

5.1.1 Normal model 

The first two simulated examples concern linear models introduced in the introduc- 
tion, regarding the false positives and false negatives of unconditional SIS. We report 
the simulation results in Table [U in which the column labeled "Example 1" refers 
to the first setting and column labeled "Example 2" referred to the second set- 
ting. These examples are designed to fail the unconditional SIS. Not surprisingly, SIS 
performs poorly in sure screening the variables, and conditional SIS easily resolves 
the problem. Also, we note that CSIS needs only one additional variable to have 
sure screening, whereas Lasso needs 15 additional variables. Both the FDR and the 
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random decoupling methods return no false negatives under almost all of the simu- 
lations. In other words, both of the data-driven thresholding methods ensured the 
sure screening property. However, they tend to be conservative, as the numbers of 
the false positives are high. The FDR approach has a relatively small number of false 
positives when used for conditional sure independent screening. For these settings, 
FDR method was found to be less conservative than the random decoupling method. 

Table 1: The MMMS, its RSD (in parentheses), the "false negative" and "false posi- 
tive" for the linear model with n = 100 and p = 2, 000. 



Example 1 




SIS 


MLR CSIS 


CMLR 


Lasso 


MMMS 


1995 (0) 


1995 (0) 1 (0) 


1(0) 


16 (0) 


FP., FN^ 


1531, 0.07 


1859, 1.00 175, 


112, 




FPpDR, FNpDR 


1934, 0.07 


164, 






Example 2 




SIS 


MLR CSIS 


CMLR 


Lasso 


MMMS 


1999 (0) 


1999 (0) 1 (0) 


1(0) 


16 (0) 


FP., FN^ 


1998, 0.01 


1998, 0.04 543.1, 


174, 




FPfdr, FNpDR 


1998, 0.01 


15.66, 







In the next tw o settings, we work with higher dimensions, p = 5,000 and p 



40,000. Following 



Fan and Songl (|2010| ). we generate the covariates from 



1 + a 



(16) 



where e and {ejWJ^ are i.i.d. standard normal random variables, {ej} 



2p/3 

j=p/3+l 



are 



i.i.d. double exponential variables with location parameter zero and scale param- 
eter one and {^i}^=2p/3+i i.i.d. and follow a mixture normal distribution with 
two components A^(— 1,1), A^(l,0.5) and equal mixture proportion. The covariates 
are standardized to have mean zero and variance one. Specifically, we consider the 
following two settings. 
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Example 3. In this setting, p = 5,000 and s = 12. The constants oi, . . . , aioo 
are the same and chosen such that the correlation p = Corr(Xj,Xj) = 0,0.2,0.4,0.6 
and 0.8 among the first 100 variables and aioi = . . . = 05^000 = 0. 

Example 4. In this setting, p = 40, 000 and s = 6. The constants ai, . . . , a^o are 
generated from the normal random distribution with mean a and variance 1 and 051 = 
• • • ,040,000 = 0. The constant a is taken such that E(Corr(Xj, X^)) = 0,0.2,0.4,0.6 
and 0.8 among the first r variables. 

In both of the settings f3* is generated from an alternating sequence of 1 and 1.3. 
For conditional sure independence screening, we condition on the first 2 covariates if 
s = 6 and we condition on the first 4 covariates if s = 12. Results are presented in 
Tables [2] and El 

As expected, CSIS needs a smaller model size to have all the relevant variables, i.e. 
to possess the sure screening property. The effect is more pronounced for higher p and 
when more of the variables are correlated. A surprising result is that the advantage 
of conditioning is less when the correlation levels are higher. This is probably because 
of the fact that only 50 or 100 of the covariates are correlated, hence conditioning 
cannot fully utilize its advantages. We also see that, both methods for choosing the 
thresholding parameter are very effective. Both the FDR and empirical decoupling 
methods tend to have the sure screening property (no false negatives) and low number 
of false positives. 

5.1.2 Binomial model 

In this section data are given by i.i.d. copies of (X"^,y), where the conditional 
distribution of Y given X = x is a binomial distribution with probability of success 
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Table 2: The MMMS, its RSD (in parentheses), the "false positive" and "false nega- 
tive" for Example 3 with p = 5, 000 and s = 4 + 8. 



Sure Independence Screening 



p 


n 


MMMS 


FP. 


FN^ 


FPpDR 


FNpDR 


0.00 


300 


86 (150) 


0.21 


4.61 


20.75 


1.23 


0.20 


100 


43 (19) 


34.17 


0.82 


87.70 


0.03 


0.40 


100 


56 (20) 


87.38 


0.00 


101.75 


0.00 


0.60 


100 


58 (24) 


88.20 


0.00 


101.68 


0.00 


0.80 


100 


63 (19) 


88.17 


0.00 


101.64 


0.00 


Conditional Sure Independence Screening 


P 


n 


MMMS 


FP. 


FN^ 


FPpDR 


FNpDR 


0.00 


300 


57 (92) 


0.16 


3.74 


21.09 


0.97 


0.20 


100 


31 (38) 


2.74 


2.97 


29.93 


0.69 


0.40 


100 


29 (21) 


17.65 


0.99 


48.03 


0.42 


0.60 


100 


32 (18) 


44.93 


0.23 


55.60 


0.29 


0.80 


100 


42 (20) 


67.55 


0.06 


50.01 


0.66 



Maximum Likelihood Ratio 



p n 


MMMS 


FP. 


FN. 


0.00 300 


86 (141) 


0.77 


0.23 


0.20 100 


43 (20) 


47.88 


0.03 


0.40 100 


52 (19) 


88.48 


0.00 


0.60 100 


58 (18) 


88.78 


0.00 


0.80 100 


60 (19) 


88.75 


0.00 


Conditional Maximum Likelihood Ratio 


p n 


MMMS 


FP. 


FN. 


0.00 300 


18 (25) 


0.72 


1.65 


0.20 100 


23 (24) 


5.71 


1.44 


0.40 100 


23 (17) 


16.45 


0.76 


0.60 100 


28 (19) 


23.81 


0.55 


0.80 100 


33 (22) 


26.09 


0.69 
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Table 3: The MMMS, its RSD (in parentheses), the "false positive" and "false nega- 
tive" for Example 4 with p = 40, 000 and s = 2 + 4. 



Sure Independence Screening 



p 


n 


MMMS 


FP. 


FN^ 


FPfdr 


FNpDR 


0.00 


200 


1133 (8246) 


11.46 


1.35 


40.70 


0.89 


0.20 


200 


37 (1079) 


30.37 


0.61 


57.83 


0.46 


0.40 


200 


37 (12) 


37.92 


0.32 


62.71 


0.24 


0.60 


200 


37 (11) 


41.35 


0.17 


65.61 


0.13 


0.80 


200 


36 (12) 


43.73 


0.02 


66.89 


0.02 


Conditional Sure Independence Screening 


P 


n 


MMMS 


FP. 


FN^ 


FPfdr 


FNpDR 


0.00 


200 


13 (84) 


5.83 


0.57 


31.04 


0.43 


0.20 


200 


16 (18) 


16.62 


0.31 


41.07 


0.23 


0.40 


200 


16 (12) 


23.89 


0.11 


45.61 


0.08 


0.60 


200 


17 (10) 


29.83 


0.03 


50.05 


0.01 


0.80 


200 


17 (10) 


37.41 


0.00 


54.34 


0.02 



Maximum Likelihood Ratio 



p 


n 


MMMS 


FP. 


FN^ 


0.00 


200 


1133 (8246) 


13.61 


0.19 


0.20 


200 


41 (1503) 


31.62 


0.11 


0.40 


200 


37 (12) 


39.24 


0.06 


0.60 


200 


37 (11) 


42.51 


0.05 


0.80 


200 


36 (12) 


44.45 


0.00 


Conditional Maximum Likelihood Ratio 


P 


n 


MMMS 


FP. 


FN^ 


0.00 


200 


14 (261) 


5.42 


0.07 


0.20 


200 


10 (21) 


13.02 


0.05 


0.40 


200 


7(10) 


18.04 


0.02 


0.60 


200 


6 (5) 


21.66 


0.01 


0.80 


200 


6(3) 


25.00 


0.00 
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P(x) = exp (x'^/3*) (l + exp (x-^/3*)) ^ . The first two settings use the same setup of 
covariates and the same values for f3* as that in Example 1. The results are given in 
Table H 

The results are almost the same as in the normal model. Conditional screening 
always lists the active variable as the most important one and Lasso only needs 16 
variables. We also see that FDR and random decoupling methods are still successful, 
even though the setting is nonlinear. 

The final settings for the binomial model use the same construction for the co- 
variates as those in Examples 3 and 4. We again work with s = 6 and s = 12. For 
settings 2 and 3, f3* is again given by a sequence of Is and 1.3s. Results are given in 
Tables \5\ and O 

The results are the same as for the normal model. Due to the nonlinear nature of 
the problem, the minimum model size is slightly higher and the thresholding methods 
are less efficient. However, even though the covariates are not too correlated, overall 
advantage of conditional sure independence screening can easily be observed. 

Table 4: The MMMS, its RSD (in parentheses) for the binomial model with the "false 
negative" and "false positive" settings for n = 100 and p = 2, 000. 



Example 1 




SIS 


MLR CSIS 


CMLR 


Lasso 


MMMS 


1995 (1.5) 


1995 (1.5) 1 (0) 


1(0) 


16 (0) 


FP., FN^ 


726, 0.07 


1282, 1.00 35.72, 


31.11, 0.01 




FPpDR, FNfdr 


1344, 0.07 


34.05, 






Example 2 




SIS 


MLR CSIS 


CMLR 


Lasso 


MMMS 


1999 (0) 


1999 (0) 1 (0) 


1(0) 


16 (0) 


FP., FN^ 


1998, 0.03 


1998, 0.14 462, 


157, 0.01 




FPpDR, FNfdr 


1998, 0.04 


5.65, 
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Table 5: The MMMS, its RSD (in parentheses), the "false positive" and "false nega- 
tive" for Example 3 with the binomial model with p = 5, 000 and s = 4 + 8. 



Sure Independence Screening 



p 


n 


MMMS 


FP, 


FN, 


FPfdr 


FNpDR 


0.00 


300 


215 (312) 


0.19 


5.78 


23.06 


1.77 


0.20 


300 


27 (14) 


73.22 


0.02 


109.56 


0.00 


0.40 


300 


49 (21) 


88.19 


0.00 


110.15 


0.00 


0.60 


300 


56 (20) 


88.17 


0.00 


110.00 


0.00 


0.80 


300 


68 (19) 


88.20 


0.00 


110.34 


0.00 


Conditional Sure Independence Screening 


P 


n 


MMMS 


FP, 


FN, 


FPfdr 


FNfdr 


0.00 


300 


87 (173) 


20.15 


1.24 


24.03 


1.11 


0.20 


300 


19 (13) 


49.25 


0.14 


53.87 


0.11 


0.40 


300 


34 (23) 


67.82 


0.17 


61.72 


0.31 


0.60 


300 


43 (24) 


77.36 


0.21 


53.83 


1.01 


0.80 


300 


66 (55) 


78.33 


0.51 


36.16 


3.42 



Maximum Likelihood Ratio 



p 


n 


MMMS 


FP, 


FN, 


0.00 


300 


210 (312) 


20.18 


0.08 


0.20 


300 


28 (17) 


107.08 


0.00 


0.40 


300 


47 (24) 


107.82 


0.00 


0.60 


300 


60 (22) 


107.47 


0.00 


0.80 


300 


67 (19) 


107.30 


0.00 


Conditional Maximum Likelihood Ratio 


P 


n 


MMMS 


FP, 


FN, 


0.00 


300 


83 (173) 


20.18 


1.21 


0.20 


300 


20 (14) 


45.27 


0.20 


0.40 


300 


39 (30) 


53.48 


0.49 


0.60 


300 


71 (87) 


49.47 


1.15 


0.80 


300 


402 (561) 


35.42 


3.43 
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Table 6: The MMMS, its RSD (in parentheses), the "false positive" and "false nega- 
tive" for Example 4 with the binomial model with p = 40, 000 and 3 = 2 + 4. 



Sure Independence Screening 



p 


n 


MMMS 


FP, 


FN, 


FPpDR 


FNpDR 


0.00 


500 


318 (7038) 


12.04 


1.22 


51.32 


0.79 


0.20 


500 


38 (428) 


32.47 


0.57 


68.46 


0.38 


0.40 


500 


38 (12) 


38.66 


0.27 


73.42 


0.19 


0.60 


500 


38 (12) 


41.99 


0.16 


76.11 


0.10 


0.80 


500 


35 (12) 


43.84 


0.03 


77.38 


0.02 


Conditional Sure Independence Screening 


P 


n 


MMMS 


FP, 


FN, 


FPpDR 


FNfdr 


0.00 


500 


13 (354) 


5.96 


0.66 


42.51 


0.49 


0.20 


500 


15 (16) 


14.51 


0.39 


49.79 


0.27 


0.40 


500 


16 (13) 


19.11 


0.24 


51.68 


0.22 


0.60 


500 


19 (10) 


22.80 


0.21 


51.78 


0.24 


0.80 


500 


19 (10) 


26.39 


0.14 


46.49 


0.64 



Maximum Likelihood Ratio 



p 


n 


MMMS 


FP, 


FN, 


0.00 


500 


309 (7030) 


14.06 


0.22 


0.20 


500 


37 (255) 


34.10 


0.09 


0.40 


500 


35.5 (11) 


40.50 


0.05 


0.60 


500 


35.5 (12) 


42.89 


0.03 


0.80 


500 


33.5 (14) 


44.39 


0.00 


Conditional Maximum Likelihood Ratio 


P 


n 


MMMS 


FP, 


FN, 


0.00 


500 


25 (892) 


5.96 


0.14 


0.20 


500 


13 (62) 


12.38 


0.09 


0.40 


500 


13 (22) 


14.17 


0.08 


0.60 


500 


15.5 (17) 


13.75 


0.11 


0.80 


500 


22 (72) 


9.30 


0.28 
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5.1.3 Robustness of CSIS 



In this section, we evaluate the performance of CSIS under three different conditioning 
sets: The set consists of (i) only active variables, (ii) both active and inactive variables 
and (iii) only (randomly chosen) inactive variables. We consider a different correlation 
structure where the number of correlated variables is significantly large. 

For this experiment. Example 5, we set p = 10, 1000 and s = 6. We generate 
covariates from equation ( fT6l) and choose the constants oi,..., 02000 such that the 
correlation p = Corr(Xj,Xj) = 0,0.2,0.4,0.6 and 0.8 among the first 2000 variables 
and asooi = . . . = aio,ooo = 0. We fix /3* = {1, 2, 1, 2, 0, ... , 0, 1, 2}^. 

The following three conditioning sets are considered (i) Ci = {1,2}; (ii) C2 = 
{1,2,5,2001} and (iii) C3 ={random choice of 4 inactive variables}. More precisely, 
C3 consists of 3 randomly chosen variables from the first two thousand variables 
which are correlated and 1 randomly chosen inactive variable from the rest. Note 
that variables 1 and 2 are active variables whereas variables 5 and 2001 are inactive. 
We have simulation results using both the conditional MLE ([5]) and conditional MLR 
([6]). To save the space, we only present the results using the conditional MLE for the 
normal model in Table [7] and for the binomial model in Table [81 

The results show clearly that the benefits of conditional screening are significant 
even when variables are wrongly chosen. CSIS reduces the minimum model size at 
least by half, and for most of the cases it uses 10 times as less variables as the 
unconditioning one. CSIS performs well even if some of the conditioned variables 
are inactive or even all are randomly selected inactive variables. For the worst cases, 
"mis-conditioning" forced CSIS to recruit twice as many variables, and for most of 
the cases, the difference is not excessive. In all cases, CSIS performs significantly 
better than the unconditioning case. 
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Table 7: The MMMS, its RSD (in parentheses), the "false positive" and "false nega- 
tive" for Example 5 for the Linear Model with p = 10, 000 and 3 = 2 + 4. 



Sure Independence Screening 



p 


n 


MMMS 


FP. 


FN. 


FPfdr 


FNfdr 


0.00 


200 


35 (80) 


98.20 


0.28 


20.16 


0.63 


0.20 


200 


1601 (812) 


1854.75 


0.34 


1537.35 


0.51 


0.40 


200 


2038 (267) 


2083.30 


0.45 


2010.73 


0.63 


0.60 


200 


2108 (470) 


2088.11 


0.52 


2010.59 


0.73 


0.80 


200 


2193 (663) 


2092.08 


0.58 


2010.59 


0.83 


CSIS with Ci 


P 


n 


MMMS 


FP. 


FN. 


FPfdr 


FNfdr 


0.00 


200 


6(8) 


98.17 


0.07 


23.51 


4.00 


0.20 


200 


13 (47) 


440.33 


0.04 


143.85 


3.90 


0.40 


200 


75 (215) 


1001.84 


0.03 


336.05 


3.67 


0.60 


200 


216 (358) 


1372.48 


0.01 


379.81 


3.64 


0.80 


200 


423 (429) 


1518.04 


0.00 


234.19 


3.79 


CSIS with C2 


P 


n 


MMMS 


FP. 


FN. 


FPfdr 


FNfdr 


0.00 


200 


6(7) 


98.29 


0.08 


23.44 


4.00 


0.20 


200 


21 (75) 


565.76 


0.03 


212.80 


3.75 


0.40 


200 


152 (413) 


1367.95 


0.03 


642.06 


3.33 


0.60 


200 


443 (676) 


1766.88 


0.01 


830.50 


3.12 


0.80 


200 


868 (643) 


1860.01 


0.00 


594.86 


3.40 


CSIS with C3 


P 


n 


MMMS 


FP. 


FN. 


FPfdr 


FNfdr 


0.00 


200 


44 (90) 


100.33 


0.30 


23.23 


2.31 


0.20 


200 


481 (687) 


1022.85 


0.24 


499.31 


1.50 


0.40 


200 


1322 (752) 


1806.40 


0.20 


1147.03 


0.86 


0.60 


200 


1652 (462) 


2003.43 


0.10 


1345.32 


0.63 


0.80 


200 


1716 (297) 


2037.08 


0.03 


1103.83 


0.94 
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Table 8: The MMMS, its RSD (in parentheses), the "false positive" and "false nega- 
tive" for Example 5 for the Binomial Model with p = 10, 000 and s = 2 + 4. 



Sure Independence Screening 



p 


n 


MMMS 


FP, 


FN, 


FPpDR 


FNpDR 


0.00 


400 


24 (59) 


97.39 


0.21 


27.29 


0.48 


0.20 


400 


1606 (776) 


1933.60 


0.20 


1725.60 


0.39 


0.40 


400 


2029 (101) 


2082.82 


0.30 


2016.35 


0.52 


0.60 


400 


2070 (258) 


2087.22 


0.45 


2015.59 


0.64 


0.80 


400 


2096 (429) 


2090.86 


0.51 


2015.07 


0.66 


CSIS with Ci 


P 


n 


MMMS 


FP, 


FN, 


FPpDR 


FNpDR 


0.00 


400 


8 (16) 


98.20 


0.10 


31.98 


4.00 


0.20 


400 


22 (75) 


361.04 


0.10 


138.73 


3.85 


0.40 


400 


107 (223) 


743.80 


0.08 


247.20 


3.74 


0.60 


400 


289 (439) 


1022.71 


0.10 


246.67 


3.75 


0.80 


400 


637 (528) 


1142.79 


0.16 


133.97 


3.82 


CSIS with C2 


P 


n 


MMMS 


FP, 


FN, 


FPpDR 


FNpDR 


0.00 


400 


7(17) 


98.33 


0.11 


31.31 


4.00 


0.20 


400 


27 (114) 


460.60 


0.11 


196.27 


3.83 


0.40 


400 


176 (429) 


1045.28 


0.08 


456.86 


3.52 


0.60 


400 


578 (759) 


1394.61 


0.10 


508.52 


3.55 


0.80 


400 


910 (673) 


1480.91 


0.10 


291.69 


3.71 


CSIS with C3 


P 


n 


MMMS 


FP, 


FN, 


FPpDR 


FNpDR 


0.00 


400 


309 (919) 


100.00 


0.89 


14.83 


2.69 


0.20 


400 


777 (1129) 


529.20 


0.66 


149.64 


2.12 


0.40 


400 


1285 (1075) 


1087.79 


0.56 


333.27 


1.96 


0.60 


400 


1572 (977) 


1383.80 


0.58 


336.54 


2.06 


0.80 


400 


1629 (892) 


1485.02 


0.57 


178.37 


2.79 
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5.2 Leukemia Data 



In this section, we demonstrate how CSIS can be used to do variable selection with 



an empirica 



Golub et al. 



dataset. We consider the leukemia dataset which was first studied by 



( I1999I ) and is available at http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi 
The data come from a study of gene expression in two types of acute leukemias, acute 
lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). Gene expression 
levels were measured using Affymetrix oligonucleotide arrays containing 7129 genes 
and 72 samples coming from two classes, namely 47 in class ALL and 25 in class AML. 
Among these 72 samples, 38 (27 ALL and 11 AML) are set to be training samples 
and 34 (20 ALL and 14 AML) are set as test samples. For this dataset we want to se- 
lect the relevant genes, and based on the selected genes estimate whether the patient 
has ALL or AML. AML progresses very fast and has a poor prognosis. Therefore, a 
consistent classification method that relies on gene expression levels would be very 
beneficial for the diagnosis. 



In order t o cho ose the conditioning genes, we take a pair of genes described in 



Golub et al. 



(Il999[ ) that result in low test errors. First is Zyxin and the second one 
is Transcriptional activator hSNF2b. Both genes have empirically high correlations 
for the difference between people with AML and ALL. 

After conditioning on the aforementioned genes, we implement our conditional 
selection procedure using logistic regression. Using the random decoupling method, 
we select a single gene, TCRD (T-cell receptor delta locus). Although this gene has 
not been discovered by the ALL / AML studies so far, it is kn own to have a relation 



with T-Cell ALL, a subgroup of ALL ( ISzczepaski et al. 



20031 ). By using only these 



three genes, we are able to obtain a training error of out of 38, and a test error 
of 1 out of 34. Similar studies in the past using sparse linear discriminant analysis 
or nearest shrunken centroids methods have obtained test errors of 1 by using more 
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than 10 variables. We conjecture that this is due to the high correlation between the 
Zyxin gene and others, and that this correlation masks the information contained in 
the TCRD gene. 



5.3 Financial Data 

In this section we illustrate the advantages of conditional sure independence screening 

on a factor model with financial data. From the website http: / /mba. tuck. dartmouth.edu/pages 

/faculty /ken. french/ we obtain 30 portfolios formed with respect to their industries. 

The returns for each portfolio are denoted by (for j = 1, . . . 30). The Fama- French 

three-factor model suggests that these returns follow the following equation 

yi = Kf! + ^f! + ^f! + e^, (i7) 

where is the excess return of the proxy market portfolio (given by the difference 
of the one-month T-Bill yield and the value weighted return of all stocks on NYSE, 
AMEX and NASDAQ), is the difference between the return of small and big com- 
panies (measured by the difference of returns of two portfolios, one with companies 
that have small market cap and one with companies with large market cap) and fi- 
nally is the difference of r eturn from value compan ies and growth companies. This 



model was first proposed by 



Fama and French! (119931 ) and has been extensively ana- 



lyzed since then. Since this seminal work, many other factors have been considered. 
In our numerical example, we used screening with the permutation test to detect if 
other factors are necessary. Besides the three factors mentioned above, we consider 
the momentum factor as an additional factor. This gives us 4 factors that are condi- 
tioned upon in CSIS. For each given industrial portfolio, we also consider the returns 
from the other 29 portfolios as potential prediction factors. 
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We use daily returns data from 1/3/2002 to 12/31/2007. For each portfolio (30 
in total), we first consider the marginal screening without conditioning. On average, 
for each portfolio, marginal screening picks 25.3 among 29 other industrial portfolios 
as predictors. This is mainly due to correlations between the returns of different 
portfolios. We next consider conditional marginal screening, in which the three Fama- 
French factors and the momentum factor are conditioned upon. As expected, the 
number of the selected variables decreases significantly to an average of 4.8. That 
is, about 4.8 portfohos on average can still have some potential prediction power in 
presence of the aforementioned four major factors. The marginal and conditional fits 
of the values are given in Figure HI The black parts indicate the variables which are 
not included. 

It is seen from these results that, conditional screening is more advantageous 
compared to marginal screening if few of the factors are known to be important. Fur- 
thermore, when there is significant correlation between some of the factors, as shown 
in the introduction, marginal screening considers most of the factors as relevant. In 
almost all financial models, stock returns are correlated with the return of the market 
portfolio. Therefore, in variable selection for financial factor models with many vari- 
ables, one should always consider the returns conditional on the main driving forces 
of the market. 

APPENDIX 

A.l Proof of Theorem [1] 

Proof of TheoremUl The necessary part has already been proven in Section 3.1. To 
prove the sufficient condition, we first note that condition Cov^, (F, Xj|Xc) = is 
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20 1 



3= 



12 



I 



5 10 15 20 25 30 
M 



1.2 
1 

0.8 
0.6 
0.4 
0.2 




(a) /? using marginal screening 




(b) /? using conditional screening 



Figure 4: Chosen factors with marginal (left) and conditional screening (right), 
equivalent to 



Eb'{X'cl3c)Xj =EYXj, 

iM \T r\\T 



as shown in Section 3.1. This and (fTTj) imply that {{(3c ) 5 0) is a solution to equation 



. By the uniqueness, it follows that /J^J = {{(3cf,0f, namely /3f = 0. This 



completes the proof. 



□ 



A.2 Proof of Theorem [2 

Proof of Theorem\^ We denote the matrix Em^Xcj-Xcj as Vlj and partition it as 

EmjXcXc EmjXcXj 



EmjXjX^ Em^Xj 









0^ - ■ 



From the score equations, i.e. equations ([2]) and fill I) , we have that 

E6' (X^/3^0 Xc = E6' (X^^./3^5) Xc. 
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Using the definition of rrij, the above equation can be written as 



Em,(X5,./3^j-X^/3^0Xc = 0. 



By letting fS^ j = f3^ji — f3^ , we have that 



Em,(X^/3g_ + Xj/3f )Xc = 0. 



or equivalently 

/3a,, = - ^c,c^o/3f . (A.l) 

Furthermore, by (fT3|) . we can express Cov2,(y, |Xc) as 

CoVi(F,X,|Xc) =EX,{F-E^(F|X^)}. (A.2) 

It follows from (^^ that 

Covi(F,X,|Xc) = EX, {b' {Xll3'c^ - b' (X^/3^)} . (A.3) 

Using the definition of rrij again, we have 

CoVi(F,X,|Xc) = Em,X,(X^^./3g - X^/3f ) 
= Em,X,(X^/3i^,^. + X7/3f) 

By (lA.ip . we conclude that 

Covi(r, X, |Xc) = (ilj^j - ^lPcl^c,o)Pf . (A.4) 
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Now it is easy to see by Condition [T]that 



|/3f I > c-:,^\Govl{Y,X,\Xc)\ > c^n-^. 



where C3 = Ci/c2- Taking the minimum over all j G A^x?^ gives the result. 



□ 



A.3 Proof of Theorem [3] 



The proof of Theorem [3] uses an e xponential bound for a quasi maximum likelihood 



Fan and SongI (120101 ) and we repeat their theorem 



estimator. This bound is shown in 
here to facilitate the reading. 

Let (3q = argmin^E/(X^/3,y) the population parameter, which is an interior 
point of a large compact and convex set B C MP. 

Condition 5. 

1. The Fisher information 



/(/3)=E 



d_ 



d_ 



l{X.^f3,Y) 



is finite and positive definite at /3 = /3q. Furthermore, sup^g^ , 
exists. 



2. The function (i, y) is Lipschitz with a positive constant /c„ for any f3 in B, 
and (x, in A„ = {x, y : ||x||_^ < Kn, \y\ < K*} with Kn and K* arbitrarily 
large constants. Furthermore, there exists a constant C such that 



sup \E[l{X^f3,Y)-l{X^f3„Y)] (1 - /„ (X, < o (p/ 

f3&-B,\\f3-f3^\\<CknV-\p/n)^/^ 

(A.5) 



n] 
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where J„ (x, y) = I ((x, y) e A„) with constant Vn defined below. 
3. The function / is convex in f3 and 



|E [/ Y) - I (X^/3o, Y)] I > K 11/3 - /3ol 



for some positive constants Vn, and all — /3o|| < CknV^ {p/n 



1 ^^/^^V2 



Theorem 6. ( iFan and Song 



201(1) Under Condition\^ for any t > it holds that 



P (v^ - -^oll > 16A;„ (1 + t) /k) < exp (-2tVAt) + nF (A^) 



The proof of Theorem [3] is based on Theorem El 



Proof of Theorem\^ By Lemma 1 of iFan and Songj (120101 ). Condition I2l(ii) gives the 
bound 

P(|F| >u)< siexp(— sqm). 



Hence, we have 



P(A:) < P(||X|U > i^n) + > JO < T^eM-r^K)- 



Using this and Theorem El letting 1 + t = c^VnU^^'^ / (16fc„), we have 



P 



> Can ^ 



< exp {-an^-^y iknKnf) + nrs exp {-r^K) 



for some positive constant C4. Then, by Bonferroni's inequality, we obtain 



P ( max 

V9+l<j<P 



/3f - /3f I > can-^^ < d(exp (-c4ni-2.^^^^^)-2^) ^^^^ (-roi^^)). 



43 



This proves the first conclusion. 



The second statement can be shown by considering the event 



A 



max 



< C3n~'^/2 



On the event An, by Theorem |2], it holds that for all j G Ai^v 



(3. 



M 



> C3n-'^/2. 



By letting 7 = c^n~'^ < c^n~'^ /2, on the event An we have the sure screening property, 
that is M.i,v C M.v,'y The probability bound can be shown by using the first result 
along with Bonferroni's inequality over all chosen j, which gives 



P (^^) < s [exp (-04^1-2- (A;„i^„) + nr^ exp {-r^K 



This completes the proof. 



□ 



A. 4 Proof of Theorem [4] 



Proof of Theo r em [7| The first part of the proof is similar to that of Theorem 5 of 
Fan and Sond (120101 ). The idea of this proof is to show that 



||/3^||2 = 0(A^ax(Sp|c)). 



(A.6) 
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If this holds, the size of the set {j = q + 1, . . . ,p : > en ^} can not exceed 

O (n^^Amax (Sx>|c)) for any e > 0. Thus on the event 



Bn= i max |/3f - /3f | < en""! 



the set {j = q + 1, . . . ,p : > 2en~'^} is a subset of the set {j = g + 1, . . . ,p : 

\(3j''\ > en~'^}, whose size is bounded by O (n^'^Amax {^v\c))- If we take e = C5/2, we 
obtain that 

'\Mv,j\ < O (n^-^A^ax (S^ic))) > nBn). 



Finally, by Theorem [3l we obtain that 

P(S„) > 1 - rf(exp ( - c,n'~^^{KK^r') + nr^ exp ( - roK;^) 

and therefore the statement of the theorem follows. 

We now prove (1A.6P by using Var(X^/3*) = 0(1) and (jXH). By Condition [^ii), 
the Schur's complement (fij j — fi^^-fi^^ficj) is uniformly bounded from below. There- 
fore, by flA.4p . we have 

|/3f|< A|Covz.(F,X,|Xc)|, 
for a positive constant Di. Hence, we need only to bound the conditional covariance. 
By (lA.Sp . ([9]) and Lipschitz continuity of b'{-), we have 

\Covl{Y,X,\Xc)\ = E|X,{6'(X^/3*)-6'(X^/3f)}| 
< D2E|Xy(X^/3^ - X^/3f )| 

= D2E|X,[X^/3^ + XM|- 
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where /Sq = {(3q — (3^ )■ Writing the last term in the vector form, we need to bound 



From the property of the least-squares, we have E[E2,(Xx)|Xc)X^] = E[Xx)X^]. Thus 
the above expression can be written as 



[S^,|c] + E Ez. (X^, I Xc ) [X^/3^ + El{XI\Xc)(3. 



V) 



recalling the definition of Z = EEl(X2?|Xc) (X^/3* - X^/3^) in Condition El 
Using the law of total variance, we have that 

||[S2,|c]/3^ + Z|f = /3^^[Sp|c]'/3^, + 2Z^[Sp|c] +Z^Z 

< A,,,. ( [S2,|c] ) (f3*v [^v\c\ f3*v) + 2Z^ [E^ic] + Z^Z 

< A^,. ( [S2,|c] ) Var(X^/3*) + 2Z^ [Hj^^c] + Z^Z, 



and the last two terms are o (Amax ([^©ic])) due to Condition [31 Therefore, we have 
that 

||/3i,f = 0(A^ax([S^|c])), 



and that gives us the desired result. 



□ 



A. 5 Proof of Theorem [5] 



Proof of Theorem Note that the false discovery proportion can be rewritten as 



E 



\{M 



d-\M^v\ , 



P /, 



if 



1/2 



/3 



>S\. 
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With the given conditions, by Theorem 1, we have Pf^ = 0. Since Xc includes the 



1/2 



/3 



M 



(for j G {Ai*vY) has an 



intercept term, Ecj = 0. It is known that Ij y/3j 
asymptotically standard normal distribution (Gao et al., 2008, Heyde, 1997). Then, 
it follows that for a > 



sup 



1/2 



> z -<t>(z] 



< CjTl 



-1/2 



Combining both equations, we obtain 



E 



Mv,5 n (M^v) 



Setting (5 = $ ^ (l — ^) gives the result. 



□ 
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